What’s today?

1. Transformation by whom?

It is ok to practice hard regarding the commands we learned the last time, however I did not emphasize much.

You might wonder why don’t we manipulate the data set and picture it instead of learning microscope details. For example, calculate relative frequency by group variables and just plot them.

I agree. But in this case, you need to learn how to manipulate data.

Let’s interpret the following codes:

setwd("~/Documents/ibs_course/BUS240/data")
load('gss_sm.rda')

head(gss_sm, 10)
## # A tibble: 10 × 32
##     year    id ballot   age childs sibs  degree race  sex   region incom…¹ relig
##    <dbl> <dbl> <labe> <dbl>  <dbl> <lab> <fct>  <fct> <fct> <fct>  <fct>   <fct>
##  1  2016     1 1         47      3 2     Bache… White Male  New E… $17000… None 
##  2  2016     2 2         61      0 3     High … White Male  New E… $50000… None 
##  3  2016     3 3         72      2 3     Bache… White Male  New E… $75000… Cath…
##  4  2016     4 1         43      4 3     High … White Fema… New E… $17000… Cath…
##  5  2016     5 3         55      2 2     Gradu… White Fema… New E… $17000… None 
##  6  2016     6 2         53      2 2     Junio… White Fema… New E… $60000… None 
##  7  2016     7 1         50      2 2     High … White Male  New E… $17000… None 
##  8  2016     8 3         23      3 6     High … Other Fema… Middl… $30000… Cath…
##  9  2016     9 1         45      3 5     High … Black Male  Middl… $60000… Prot…
## 10  2016    10 3         71      4 1     Junio… White Male  Middl… $60000… None 
## # … with 20 more variables: marital <fct>, padeg <fct>, madeg <fct>,
## #   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
## #   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
## #   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
## #   bigregion <fct>, partners_rc <fct>, obama <dbl>, and abbreviated variable
## #   name ¹​income16
# practice 1
religion_count <- gss_sm %>%
  group_by(religion) %>%
  summarize(N = n()) %>% arrange(-N)
religion_count
## # A tibble: 6 × 2
##   religion       N
##   <fct>      <int>
## 1 Protestant  1371
## 2 Catholic     649
## 3 None         619
## 4 Other        159
## 5 Jewish        51
## 6 <NA>          18
# practice 2
rel_by_region <- gss_sm %>%
  group_by(bigregion, religion) %>%
  summarize(N = n()) %>%
  mutate(freq = N/sum(N), pct = round((freq*100),0))
## `summarise()` has grouped output by 'bigregion'. You can override using the
## `.groups` argument.
rel_by_region
## # A tibble: 24 × 5
## # Groups:   bigregion [4]
##    bigregion religion       N    freq   pct
##    <fct>     <fct>      <int>   <dbl> <dbl>
##  1 Northeast Protestant   158 0.324      32
##  2 Northeast Catholic     162 0.332      33
##  3 Northeast Jewish        27 0.0553      6
##  4 Northeast None         112 0.230      23
##  5 Northeast Other         28 0.0574      6
##  6 Northeast <NA>           1 0.00205     0
##  7 Midwest   Protestant   325 0.468      47
##  8 Midwest   Catholic     172 0.247      25
##  9 Midwest   Jewish         3 0.00432     0
## 10 Midwest   None         157 0.226      23
## # … with 14 more rows

Need to check whether the grouping is correct:

rel_by_region %>% group_by(bigregion) %>% summarize(total = sum(pct))
## # A tibble: 4 × 2
##   bigregion total
##   <fct>     <dbl>
## 1 Northeast   100
## 2 Midwest     101
## 3 South       100
## 4 West        101

Now plot it:

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
# what is the role of this line?
p+geom_col() # why not using geom_bar?

p + geom_col(position = "dodge") +
    labs(x = "Region", y = "Percent", fill = "Religion")+
    theme(legend.position = "top")

p + geom_col(position = "dodge2") 

p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge") 

p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge") + guides(fill = "none") + coord_flip()

p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge") + guides(fill = "none") + coord_flip() +
    facet_wrap(~ bigregion) 

The coord_flip() function switches the x and y axes after the plot is made. It does not remap variables to aesthetics.

Which way is better?

2.

Let’s use different dataset.

setwd("~/Documents/ibs_course/BUS240/data")
load('organdata.rda')
head(organdata, 10)
## # A tibble: 10 × 21
##    country  year       donors   pop pop_d…¹   gdp gdp_lag health healt…² pubhe…³
##    <chr>    <date>      <dbl> <int>   <dbl> <int>   <int>  <dbl>   <dbl>   <dbl>
##  1 Austral… NA          NA    17065   0.220 16774   16591   1300    1224     4.8
##  2 Austral… 1991-01-01  12.1  17284   0.223 17171   16774   1379    1300     5.4
##  3 Austral… 1992-01-01  12.4  17495   0.226 17914   17171   1455    1379     5.4
##  4 Austral… 1993-01-01  12.5  17667   0.228 18883   17914   1540    1455     5.4
##  5 Austral… 1994-01-01  10.2  17855   0.231 19849   18883   1626    1540     5.4
##  6 Austral… 1995-01-01  10.2  18072   0.233 21079   19849   1737    1626     5.5
##  7 Austral… 1996-01-01  10.6  18311   0.237 21923   21079   1846    1737     5.6
##  8 Austral… 1997-01-01  10.3  18518   0.239 22961   21923   1948    1846     5.7
##  9 Austral… 1998-01-01  10.5  18711   0.242 24148   22961   2077    1948     5.9
## 10 Austral… 1999-01-01   8.67 18926   0.244 25445   24148   2231    2077     6.1
## # … with 11 more variables: roads <dbl>, cerebvas <int>, assault <int>,
## #   external <int>, txp_pop <dbl>, world <chr>, opt <chr>, consent_law <chr>,
## #   consent_practice <chr>, consistent <chr>, ccode <chr>, and abbreviated
## #   variable names ¹​pop_dens, ²​health_lag, ³​pubhealth

It contains a little more than a decade’s worth of information on the donation of organs for transplants in seventeen OECD countries.

country. Country name.

year. Year.

donors. Organ Donation rate per million population.

pop. Population in thousands.

pop_dens. Population density per square mile.

gdp. Gross Domestic Product in thousands of PPP dollars.

gdp_lag. Lagged Gross Domestic Product in thousands of PPP dollars.

health. Health spending, thousands of PPP dollars per capita.

health_lag Lagged health spending, thousands of PPP dollars per capita.

pubhealth. Public health spending as a percentage of total expenditure.

roads. Road accident fatalities per 100,000 population.

cerebvas. Cerebrovascular deaths per 100,000 population (rounded).

assault. Assault deaths per 100,000 population (rounded).

external. Deaths due to external causes per 100,000 population.

txp_pop. Transplant programs per million population.

world. Welfare state world (Esping Andersen.)

opt. Opt-in policy or Opt-out policy.

consent_law. Consent law, informed or presumed.

consent_practice. Consent practice, informed or presumed.

consistent. Law consistent with practice, yes or no.

ccode. Abbreviated country code.

Let’s interpret the following commands

organdata %>% select(1:6) %>% sample_n(size = 10)
## # A tibble: 10 × 6
##    country        year       donors   pop pop_dens   gdp
##    <chr>          <date>      <dbl> <int>    <dbl> <int>
##  1 Switzerland    1995-01-01   13    7041   17.1   26304
##  2 Australia      1995-01-01   10.2 18072    0.233 21079
##  3 Switzerland    2000-01-01   14    7184   17.4   29837
##  4 Denmark        1997-01-01   11.4  5285   12.3   24676
##  5 United Kingdom NA           NA   57238   23.6   16228
##  6 Germany        NA           NA      NA   NA        NA
##  7 Australia      NA           NA   17065    0.220 16774
##  8 Italy          1999-01-01   13.7 57646   19.1   23729
##  9 Ireland        1999-01-01   18.7  3756    5.35  25936
## 10 Spain          1994-01-01   25   39166    7.74  15024

Lets’s start by naively graphing some of the data. We can take a look at a scatterplot of donors vs year.

p <- ggplot(organdata,aes(year,donors)) # when you practice, put all the argument names in
p + geom_point()
## Warning: Removed 34 rows containing missing values (geom_point).

The graph looks too rough.

p <- ggplot(data = organdata,
            mapping = aes(x = year, y = donors))
p + geom_line(aes(group = country))
## Warning: Removed 34 row(s) containing missing values (geom_path).

Oh faceting. But need to check

table(organdata$country)
## 
##      Australia        Austria        Belgium         Canada        Denmark 
##             14             14             14             14             14 
##        Finland         France        Germany        Ireland          Italy 
##             14             14             14             14             14 
##    Netherlands         Norway          Spain         Sweden    Switzerland 
##             14             14             14             14             14 
## United Kingdom  United States 
##             14             14
p <- ggplot(data = organdata,
            mapping = aes(x = year, y = donors))
p + geom_line(aes(group = country)) + facet_wrap(~ country)
## Warning: Removed 34 row(s) containing missing values (geom_path).

Let’s focus on the country-level variation ignoring time variation now. We can use geom_boxplot() to get a picture of variation by year across countries. Just as geom_bar() by default calculates a count of observations by the category you map to x, the stat_boxplot() function that works with geom_boxplot() will calculate a number of statistics that allow the box and whiskers to be drawn. We tell geom_boxplot() the variable we want to categorize by (here, country) and the continuous variable we want summarized (here, donors)

p <- ggplot(data = organdata, mapping = aes(x = country, y = donors))
p + geom_boxplot()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

p <- ggplot(data = organdata, mapping = aes(x = country, y = donors))
p + geom_boxplot() + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

See the order of faceting plots and box plots.

More clever way is to have the countries listed from high to low average donation rate. Think about reordering the country variable by the mean of donors.

reorder() has two arguments: 1. the categorical variable or factor that we want to reorder 2. by what? need to indicate that variable If you only give reorder() the first two required arguments, then by default it will reorder the categories of your first variable by the mean value of the second

p <- ggplot(data = organdata, 
            mapping = aes(x = reorder(country, donors), 
                          y = donors))
p + geom_boxplot() + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Well, it did not work out well. See:

vec1 <- c(1,2,3)
vec2 <- c(4,5,NA)
mean(vec1)
## [1] 2
mean(vec2)
## [1] NA

In R, the default mean function will fail with an error if there are missing values in the variable you are trying to take the average of. ou must say that it is OK to remove the missing values when calculating the mean. This is done by supplying the na.rm=TRUE argument to reorder(), which internally passes that argument on to mean().

mean(vec2, na.rm = TRUE)
## [1] 4.5
p <- ggplot(data = organdata, 
            mapping = aes(x = reorder(country, donors, na.rm = TRUE), 
                          y = donors))
p + geom_boxplot() + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

p <- ggplot(data = organdata, 
            mapping = aes(x = reorder(country, donors, na.rm = TRUE), 
                          y = donors,
                          fill = world))
p + geom_boxplot() + coord_flip() + theme(legend.position = "bottom") +
  labs(x = NULL)
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Just a matter of representation

p <- ggplot(data = organdata, 
            mapping = aes(x = reorder(country, donors, na.rm = TRUE), 
                          y = donors,
                          color = world))
p + geom_point() + coord_flip() + theme(legend.position = "bottom") +
  labs(x = NULL)
## Warning: Removed 34 rows containing missing values (geom_point).

p <- ggplot(data = organdata, 
            mapping = aes(x = reorder(country, donors, na.rm = TRUE), 
                          y = donors,
                          color = world))
p + geom_point(alpha = .5) + coord_flip() + theme(legend.position = "bottom") +
  labs(x = NULL)
## Warning: Removed 34 rows containing missing values (geom_point).

geom_jitter works much like geom_point(), but randomly nudges each observation by a small amount.

p <- ggplot(data = organdata, 
            mapping = aes(x = reorder(country, donors, na.rm = TRUE), 
                          y = donors,
                          color = world))
p + geom_jitter() + coord_flip() + theme(legend.position = "bottom") +
  labs(x = NULL)
## Warning: Removed 34 rows containing missing values (geom_point).

p <- ggplot(data = organdata, 
            mapping = aes(x = reorder(country, donors, na.rm = TRUE), 
                          y = donors,
                          color = world))
p + geom_jitter(position = position_jitter(width = .15)) + coord_flip() + theme(legend.position = "bottom") +
  labs(x = NULL)
## Warning: Removed 34 rows containing missing values (geom_point).

Cleveland doptplot

When we want to summarize a categorical variable that just has one point per category, we should use this approach as well. Think about this:

by_country <- organdata %>% group_by(consent_law, country) %>%
  summarize(donors_mean = mean(donors, na.rm = TRUE),
            donors_sd = sd(donors, na.rm = TRUE),
            gdp_mean = mean(gdp, na.rm = TRUE),
            health_mean = mean(health, na.rm = TRUE),
            roads_mean = mean(roads, na.rm = TRUE),
            cerevbas_mean = mean(cerebvas, na.rm = TRUE))
## `summarise()` has grouped output by 'consent_law'. You can override using the
## `.groups` argument.
by_country
## # A tibble: 17 × 8
## # Groups:   consent_law [2]
##    consent_law country        donors_m…¹ donor…² gdp_m…³ healt…⁴ roads…⁵ cerev…⁶
##    <chr>       <chr>               <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Informed    Australia            10.6   1.14   22179.   1958.   105.     558.
##  2 Informed    Canada               14.0   0.751  23711.   2272.   109.     422.
##  3 Informed    Denmark              13.1   1.47   23722.   2054.   102.     641.
##  4 Informed    Germany              13.0   0.611  22163.   2349.   113.     707.
##  5 Informed    Ireland              19.8   2.48   20824.   1480.   118.     705.
##  6 Informed    Netherlands          13.7   1.55   23013.   1993.    76.1    585.
##  7 Informed    United Kingdom       13.5   0.775  21359.   1561.    67.9    708.
##  8 Informed    United States        20.0   1.33   29212.   3988.   155.     444.
##  9 Presumed    Austria              23.5   2.42   23876.   1875.   150.     769.
## 10 Presumed    Belgium              21.9   1.94   22500.   1958.   155.     594.
## 11 Presumed    Finland              18.4   1.53   21019.   1615.    93.6    771.
## 12 Presumed    France               16.8   1.60   22603.   2160.   156.     433.
## 13 Presumed    Italy                11.1   4.28   21554.   1757    122.     712.
## 14 Presumed    Norway               15.4   1.11   26448.   2217.    70.0    662.
## 15 Presumed    Spain                28.1   4.96   16933    1289.   161.     655.
## 16 Presumed    Sweden               13.1   1.75   22415.   1951.    72.3    595.
## 17 Presumed    Switzerland          14.2   1.71   27233    2776.    96.4    424.
## # … with abbreviated variable names ¹​donors_mean, ²​donors_sd, ³​gdp_mean,
## #   ⁴​health_mean, ⁵​roads_mean, ⁶​cerevbas_mean
p <- ggplot(data = by_country,
            mapping = aes(x = donors_mean,
                          y = reorder(country, donors_mean),
                          color = consent_law))
p + geom_point() 

p <- ggplot(data = by_country,
            mapping = aes(x = donors_mean,
                          y = reorder(country, donors_mean),
                          color = consent_law))
p + geom_point(size = 3) +
  labs(x = 'Donor Procurement Rarte', y = "",
       color = "Consent Law") +
  theme(legend.position = "bottom")

If you don’t like, you might facet them

p <- ggplot(data = by_country,
            mapping = aes(x = donors_mean,
                          y = reorder(country, donors_mean)))
                          
p + geom_point(size = 3) + 
    facet_wrap( ~ consent_law) +
    labs(x = 'Donor Procurement Rarte', y = "")

p <- ggplot(data = by_country,
            mapping = aes(x = donors_mean,
                          y = reorder(country, donors_mean)))
                          
p + geom_point(size = 3) + 
    facet_wrap( ~ consent_law, scales = "free_y") +
    labs(x = 'Donor Procurement Rarte', y = "")

p <- ggplot(data = by_country,
            mapping = aes(x = donors_mean,
                          y = reorder(country, donors_mean)))
                          
p + geom_point(size = 3) + 
    facet_wrap( ~ consent_law, ncol = 1) +
    labs(x = 'Donor Procurement Rarte', y = "")

Optional: if you want to delve deeper

p <- ggplot(data = by_country,
            mapping = aes(x = reorder(country, donors_mean),
                          y = donors_mean))
                          
p + geom_pointrange(mapping = aes(ymin = donors_mean - donors_sd,
                                  ymax = donors_mean + donors_sd)) + 
    labs(x = '', y = "Donor Procurement Rate") + coord_flip()

Plot Text

Sometimes text itself provides clear information.

p <- ggplot(data = by_country,
            mapping = aes(x = roads_mean, y = donors_mean))
p + geom_point() + geom_text(mapping = aes(label = country))

Make it to the left

p <- ggplot(data = by_country,
            mapping = aes(x = roads_mean, y = donors_mean))
p + geom_point() + geom_text(mapping = aes(label = country), hjust = 0)

Make it to the right

p <- ggplot(data = by_country,
            mapping = aes(x = roads_mean, y = donors_mean))
p + geom_point() + geom_text(mapping = aes(label = country), hjust = 1)

hjust will often fail because the space is added in proportion to the length of the label. The result is that longer labels move further away from their points than you want.

library(ggrepel)
p <- ggplot(data = by_country,
            mapping = aes(x = roads_mean, y = donors_mean))
p + geom_point() + geom_text_repel(mapping = aes(label = country))

It has subset function, useful for outliers

p <- ggplot(data = by_country,
            mapping = aes(x = gdp_mean, y = health_mean))
p + geom_point()

p <- ggplot(data = by_country,
            mapping = aes(x = gdp_mean, y = health_mean))
p + geom_point() + 
  geom_text_repel(data = subset(by_country, gdp_mean > 25000),
                  mapping = aes(label = country))

p <- ggplot(data = by_country,
            mapping = aes(x = gdp_mean, y = health_mean))
p + geom_point() + 
  geom_text_repel(data = subset(by_country, 
                                gdp_mean > 25000 | health_mean < 1500 |
                                country %in% "Belgium"),
                  mapping = aes(label = country))

Comment on the following codes 1

p <- ggplot(organdata, aes(roads, donors))
p + geom_point() + annotate(geom = "text", x = 91, y = 33,
                            label = "You said class is over",
                            hjust = 0)
## Warning: Removed 34 rows containing missing values (geom_point).

p <- ggplot(organdata, aes(roads, donors))
p + geom_point() + 
  annotate(geom = "rect", xmin = 125, xmax = 155,
           ymin = 30, ymax = 35, fill = "red", alpha = 0.2) +
  annotate(geom = 'text', x = 157, y = 33,
           label = "Why isn't over?", hjust = 0)
## Warning: Removed 34 rows containing missing values (geom_point).

Comment on the following codes 2

p <- ggplot(data = organdata,
            mapping = aes(x = roads, y = donors, color = world))
p + geom_point()
## Warning: Removed 34 rows containing missing values (geom_point).

p <- ggplot(data = organdata,
            mapping = aes(x = roads, y = donors, color = world))
p + geom_point() +
  scale_x_log10()+
  scale_y_continuous(breaks = c(5,15,25),
                    labels = c("Five","Fifteen","Twenty Five"))
## Warning: Removed 34 rows containing missing values (geom_point).

p <- ggplot(data = organdata,
            mapping = aes(x = roads, y = donors, color = world))
p + geom_point() +
  scale_color_discrete(
                    labels = c("Corp","Liberal","Social Demo", "No classified"))+
  labs(x = "Road Deaths",
       y = "Donor Procurement",
       color = "Welfare State")
## Warning: Removed 34 rows containing missing values (geom_point).

p <- ggplot(data = organdata,
            mapping = aes(x = roads, y = donors, color = world))
p + geom_point() +
  labs(x = "Road Deaths",
       y = "Donor Procurement")+
       guides(color ="none")
## Warning: Removed 34 rows containing missing values (geom_point).